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Here we demonstrate the use of short-read massive sequencing systems to in effect achieve longer read 
lengths through hierarchical molecular tagging. We show how indexed and PCR- amplified targeted 
libraries are degraded, sub-sampled and arrested at timed intervals to achieve pools of differing average 
length, each of which is indexed with a new tag. By this process, indices of sample origin, molecular origin, 
and degree of degradation is incorporated in order to achieve a nested hierarchical structure, later to be 
utilized in the data processing to order the reads over a longer distance than the sequencing system originally 
allows. With this protocol we show how continuous regions beyond 3000 bp can be decoded by an Illumina 
sequencing system, and we illustrate the potential applications by calling variants of the lambda genome, 
analysing TP53 in cancer cell lines, and targeting a variable canine mitochondrial region. 

High- throughput sequencing instruments have continued to demonstrate increasing number of DNA bases 
decoded per run 15 . However, despite the tremendous success the new generation of sequencers have had 
in terms of cost and sequence throughput, the length and accuracy of the traditional Sanger based 
sequencers have remained a commonly accepted standard for validation 6,7 . In parallel, there has been a gradual 
shift in interest from single base pair variations, to copy number variation, structural variation and de-novo 
assembly approaches, increasing the need for long, high-quality sequences 8 " 12 . Shorter reads have difficulty 
mapping to low complexity regions and resolving variation involving the orientation of larger regions. 
Furthermore, the ability to phase variations directly from the data is relevant for many biological questions 
where haplotype information is unknown or limited. Phasing directly is limited by the read length of the 
sequencing system 13 . Short reads also pose great challenges for de novo assembly, although many algorithms 
for this problem have been developed 14 , they struggle to completely assemble even small bacterial genomes 15 . 

To overcome the problems of short reads, protocols for linking two reads together have been developed 1016 . 
Cloning-free methods for linking reads together over long stretches of DNA range from 2 to 20 kb with increasing 
need for input DNA for increasing insert sizes. Cloning-based libraries such as fosmid libraries can further 
increase the insert size 17 . By incorporating a known distance between two sequence reads during construction 
of the library, that information can be utilized when deciphering the structure of a known genome, ordering the 
contigs built from assembly, to span or aid mapping in low complexity regions. The drawbacks of these protocols 
are the consumption of vast amounts of DNA as well as laborious and manual steps limiting scalability and 
robustness. Further, although structural problems equal to or below the insert size of long insert libraries can 
theoretically be resolved, the gain for phasing remains limited since the sequence between the linked reads is 
unknown. For this reason, phasing often relies on a high polymorphism rate of the particular genome in order to 
link the variation 15 . Recently, the use of randomized nucleotides have been reported in conjunction with mas- 
sively parallel sequencing, in order to improve the quality of detection 18 " 21 . These studies have been based on short 
target amplification, and show the usefulness of randomized tagging to count the number of unique molecules. 
The usefulness of nested short-read libraries and tag-directed local assembly have been shown previously but has 
been limited in target length to 500 bp 22 . Very recently a long fragment dilution protocol have been used to enable 
whole genome haplotyping applicable for SNPs in human genomes 23 . Despite these advances, decoding a long 
continuous sequence remains a challenge. 

The procedure described in this report, Tile-seq, is a targeted PCR-based approach that can aid in structuring 
massively parallel sequencing reads and help low complexity spanning, but also links information over the whole 
target region thereby increasing the information obtained. This is achieved by spreading the shorter reads of the 
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sequencing system over the entire region, while keeping them 
indexed for their molecular origin. Also, the information of eight 
insert sizes is incorporated rather than one. Since the larger regions 
are indexed, the mapping or assembly is a local problem, and the 
varying insert size information provides help to resolve low com- 
plexity regions and structural variation. The outlined protocol for 
multiplex long range analysis supersedes that of traditional Sanger 
sequencing in terms of the regional length and complexity possible to 
decode, and also has potential for great accuracy or dynamic range in 
detection. Further, we show the principal utility of using randomized 
nucleotide tagging to index a large pool of molecules for quantifying 
and linking unique reads together. 

Results 

To show the general principle of the protocol, the lambda genome 
was targeted with 19 PCR amplicons. Each amplicon was indexed at 



one end of the amplicon (ID-tag) and amplified in a two-step PCR 
process (first ID incorporation, then amplification). The end con- 
taining the index was protected by introducing a 3 ' ssDNA overhang. 
The amplicons were exonuclease degraded leaving the ID -tag intact, 
sub -sampled at eight different time points (TP 1-8), circularized, and 
sequenced (Figure la). To show potential applications we targeted 
the TP53 gene (exon 2 - 9) in one continuous 3184 bp region in four 
cell lines. Four different ID -tags were included to distinguish four 
different sample origins (NA10831, U251, U20S, A431). We also 
targeted one highly variable, low complexity region in the canine 
mitochondrial genome. For the applications an additional rando- 
mized index was included and used for clonal identification (CID- 
tag) (Figure lb and d). 

Lambda genome. From the lambda genome six differences to the 
Genbank entry were detected by shotgun sequencing (Figure 2). We 
covered the genome using Tile-seq at 99.8% (93% with 5x) across the 
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Figure 1 | Tile-seq basic procedure, (a) Genomic DNA is targeted and amplified to include molecular barcodes. The barcodes are protected and 
subsequently exonuclease degraded. The reaction is sub-sampled and stopped at distinct intervals, and indexed with a tag specific for the time of sampling. 
The molecules are circularized, fragmented and enriched for the junction between degraded and indexed ends. The junctions of degraded and indexed 
ends are amplified and sequenced, (b) Representation of the two PCRs incorporating a randomized clonal ID sequence and a known sample ID sequence, 
(c) Example agarose gel of eight samplings during exonuclease degradation of an amplicon showing the decreasing pool length of each TP. (d) The 
hierarchical structure given by three incorporated tags, sample id-tag (ID), clonal ID-tag (CID), and time point-tag (TP). 
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Figure 2 | Lambda genome amplicons. Top: a schematic representation of the lambda genome and primer sites for the 19 amplicons in green triangles. 
The protected end was incorporated at the left side of these amplicons. The variations detected by sequencing the lambda genome (both by shotgun and 
Tile-seq) are shown in yellow below the reference. Middle: base composition in 20 bp sliding window (G/C blue, A/T green). Bottom: Tile-seq and 
reference shotgun coverage with the variants highlighted with arrows. The reference and results are displayed over two rows. 



different amplicons. Due to circularization bias, the protocol pro- 
duces lower coverage in the 3' end of each amplicon (i.e. the longer 
fragments). The sequence depth recovered in this experiment was on 
average 14 X for the entire genome providing acceptable variation of 
coverage (Figure 2). If high coverage in the 3' end were necessary, a 
separate PCR with reversed primer design would incorporate the 
protecting uracil in both ends. Some of the potential strengths of 
this method compared to traditional capillary sequencing, apart 
from longer analysis regions, are accuracy and resolution of vari- 
ants. At a 5X depth cut-off, all of the differences detected by shot- 
gun were identified without false positives (Figure 2). This shows the 
ability to pool long-range amplicons and call simple mutations at 

14 X average coverage, spanning a 50 000 bp region. 

Human targets. A long range PCR towards TP53, exons 2-9, was 
setup and used to amplify four different targets; a reference sample: 
NA10831, and three cancer cell lines: U20S, U251, and A431. The 
reference sample PCR product was also sequenced using an ampli- 
con shotgun approach, and the cell lines had been shotgun sequenced 
to average genome coverage of approximately 21 X for U20S and 

15 X for U251 and A431. For the TP53 region targeted in this study 
the average coverage for the shotgun cell line data was approximately 
12x. Tile-seq yielded 60X, 34X, 20X and 21X respectively for the 
four targets. Generally the shotgun data and the Tile-seq data 
concurred. For NA10831, Tile-seq detected heterozygous variants 
with excellent correlation to the stochastically sheared amplicon 
(Figure 3). In this particular case, the stochastic shearing produced 
lower coverage at exon 2 and 3 for the amplicon, where Tile-seq had 
reasonably even coverage across the region (Figure 3a). One hetero- 
zygous position was also verified by Sanger sequencing (Figure 3b). 
For U20S a triple thymine heterozygous deletion was supported by 
two high-quality reads in the shotgun data, not found in the Tile-seq 
data. Another variation based on seven good quality mappings was a 
heterozygous T/G (4 and 3 reads respectively, Figure 3a, 3c), which 
was not found in the Tile-seq data based on 26 X coverage. Sanger 
sequencing supported the Tile-seq call (Figure 3a, 3c) and likely the 



shotgun variants were errors from the low coverage of the region 
(three reads supporting the variant). The Sanger sequencing of the 
homopolymeric region could not resolve the dispute (Figure 3a, 3d) 
but in general show the disrupting effect of homopolymers for 
traditional sequencing. This illustrates the potential for using the 
described protocol for validating WGS variations across long 
distances containing difficult regions. Each time point tag (TP-tag) 
should approximately correlate to a certain region of mapping to the 
amplicon. A clear trend following the degradation can be seen, albeit 
with some background trailing towards shorter molecules (Figure 4, 
Supplementary Figure S2). We believe the common peak around the 
end of the amplicon could be residual activity of exonuclease III in 
the SI nuclease buffer. Although this activity should be limited, the 
circularization of shorter molecules should be favoured, causing the 
common short molecule peak. When the reference was modified to 
include a simulated 1 kb inversion, the information supplied by the 
TP -tags clearly indicated a structural discordance to the data ob- 
tained and the reference used (Figure 4). 

Mitochondrial sample. Sanger sequencing revealed seven variants of 
the amplicons, all detected by Tile-seq (Figure 5). The highly variable 
repeat region could not be resolved completely by assembly or map- 
ping, although certain positions seems to share common variants 
(Figure 5). As with homopolymers the low complexity highly vari- 
able region does not disrupt the calling of variants before or after it. 
Coverage for the mtDNA sample was the highest (1600X) but CID- 
redundancy was not enough to allow phasing. With an increased 
redundancy of barcodes in the final libraries clonal assembly for 
highly variable regions and phasing of haplotypes could be feasible. 

Discussion 

Chimeric constructs, most likely formed during the library PCR 
where each construct share a common hairpin circularization 
adapter (see discussion in supplementary material), are common 
in the raw data (—50%) but are reduced from filtering on agreeing 
TPs (joined during circularization). We have identified a slight risk of 
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Figure 3 | TP53 amplicon results, (a) Top: shows schematic representation of the amplicon spanning exon 2 to exon 9 (right to left) and primer sites for 
Sanger sequencing. The protected end was incorporated on the right side for this amplicon (exon 2). Base composition is shown in a 20 bp sliding window 
(G/C blue, A/T green). Coverage plots are shown in four parts, one for each sample, with the corresponding reference sequencing (Whole Genome 
Shotgun (WGS) for cell lines and PCR shotgun for NA10831). Variants are highlighted by colour, and positions referring to panel b-d are indicated in 
parenthesis, (b-d) Three Sanger validated regions, showing agreeing calls for Tile-seq (b and c, arrows) and a general difficulty in determining 
homopolymers (d, dotted box). 



false positives when only running one amplicon from multiple sam- 
ples in parallel, this due to the chimeras introduced during library 
PCR. In the experiments described above the amplicon diversity in 
pooling did not present this type of limitation. Potentially alternate 
polymerases for library amplification that could improve on this 
issue, as well as alternate adapter strategies for circularization. Al- 
though efficient, the transposase-based library amplification also 
introduced some bias towards the library adapters (Supplementary 
Figure S3 and S4) which is worth mentioning for similar protocols 
seeking to take advantage of this highly efficient way of constructing 
libraries (also see discussion in supplementary material). With this 
gained knowledge, the authors wish to suggest that adapter se- 
quences be designed with as large edit distance to the transposase 
recognition as possible. 

In this study, we report on a procedure for analyzing targeted 
regions of a genome. In principle, the procedure could also be applied 
on shotgun libraries where the tagging and amplification strategy 
would need to be slightly modified. The CID tag will for this approa- 
ch be used as the ID-tag, i.e. for identifying the target region. The 
challenge of a shotgun library is the extreme level of multiplexity 
(1.5ell molecules for 0.5 jag of 3 kb DNA) and the subsequent need 
for protocol efficiency and sequence depth. 

The least efficient step of the protocol is the circularization step. 
We show the adaptation of the protocol for 3 kb inserts, which is a 



relatively good yielding insert length commonly used by standard 
mate pair protocols. Since exonuclease III is not directly limited by 
DNA length, one could envision increasing the inserts to tens of kilo 
bases. The loss of yield in the circularization step can be compensated 
by increasing starting material, which however would require a 
higher number of CID -tags, and consequently sequencing depth to 
cover them. Even for the setup at hand the number of CID -tags 
incorporated require substantial sequencing depth to be covered 
multiple times, but longer inserts may be possible if coupled to an 
increased amplification efficiency to create more of each tag prior to 
circularization. This could potentially be realized by alternative amp- 
lification strategies such as multiple displacement amplification or 
cloning, or by favouring intra-molecular ligation by solid support or 
emulsions to increase efficiency. 

Previous library protocols have been able to reconstruct 500 bp 
continuous fragments from Illumina sequencing based on concatemer- 
ization and random shearing 22 . The dependence on concatemerization 
restrains the target length possible to reconstruct in this way. The 
continued development of Illumina sequencing chemistry has now 
reached 2X250 bp using a MiSeq instrument and therefore made this 
range of continuous sequence reads more easily attained. The existing 
technologies for long reads remain traditional Sanger sequencing, and 
more recently introduced Pacific Biosciences (PacBio) SMRT sequen- 
cing. Tile-seq handles 3 kb fragments, thereby producing the longest 
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Figure 4 | Time-point tag occurrence after mapping to the references. The mapped position of each respective TP-tag detected of each amplicon is 
plotted using a Gaussian kernel density approximation (R). Top left: lambda amplicon 2. Top right: the mitochondrial canine amplicon (TP8 omitted for 
canine amplicon due to the shorter amplicon length). Bottom left: the NA10831 amplicon. Bottom right: the NA10831 data mapped to a locally inverted 
TP53 reference sequence, illustrating the possibility to detect structural variation from TP-tag occurrence. 



most accurate continuous reads currently available. PacBio SMRT 
sequencing on average produce 2 kb reads with 85% accuracy, and 
the technology is not as commonly available as Illumina. 

Tile-seq could be applied to finishing of de novo genome assem- 
blies by simultaneously target a large number of assembly gaps 
formed e.g. during scaffolding with a 3000 bp mate pair library, 
ultimately filling each gap by local assembly of the Tile-seq data. In 
the same way structural variation breakpoints could be resolved from 
primer design influenced from the original detections. Moreover, 



being able to extend the target region from a few hundred base pairs 
as in 454 amplicon sequencing to 3000 bp in Tile-seq could provide a 
major improvement of the throughput and efficacy of phylogenetic 
and taxonomic studies. Presently the protocol is limited in sample 
throughput, but given the extreme sequence throughput of the 
sequencing systems available the protocol is still competitive. 
Sample preparation for the protocol described here is automated 
and high throughput, with the throughput-limiting step being the 
PCR. After PCR the samples are pooled and the procedure scale with 
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Figure 5 | Mitochondrial canine amplicon. Top shows a schematic representation of the amplicon. Left grey box indicate the region validated by Sanger 
sequencing and the variants detected shown below in yellow, right grey box show the highly variable repeat region (HVR). The protected end was 
incorporated on the left side of this amplicon. Middle: base composition in 20 bp sliding window (G/C blue, A/T green). Bottom: Tile-seq coverage with 
agreeing variants to Sanger calls (arrows), and additional non- validated variants by coloured columns in the coverage plot. The reference and results are 
displayed over two rows. 



the throughput of the sequencing instrument. Likewise, the major 
impact on cost for the library preparation is determined by the PCRs 
and not by subsequent ligation steps. This because of the protected 
ID -tag that enable direct pooling after PCR. In theory this protocol 
should be able to multiplex 50,000 amplicons per Illumina HiSeq 
2000 lane at high accuracy (100X coverage, Supplementary Table 
SI). With the data we present in this report, we estimate a feasible 
degree of multiplexing to be 100-200 representing about 0.1% of the 
theoretical throughput (Supplementary Table SI). A Tile-seq setup 
to sequence 96 amplicons with a length of 3000 bp would require 
almost 9 times less resources for an average of 40 X coverage per 
amplicon, as compared to Sanger sequencing stretches of 600 bp 
in each direction (Supplementary table S2). 

In conclusion, we report a protocol for controllable degradation 
and tagging of pooled genomic long-range targets, in order to pro- 
duce barcoded libraries suitable for short-read sequencing. Exonu- 
clease III has previously been used for cloning experiments in 
combination with Sanger sequencing, but not been setup for a clon- 
ing-free system adapted for massively parallel sequencing. We have 
shown that with a regular PCR approach tags can be incorporated to 
in multiple layers of information to supply target origin, molecular 
origin and positional origin. By utilizing these indices, virtual read 
lengths equal to the lengths of the targets can be achieved, surpassing 
that of traditional Sanger sequencing as well as the capacity of span- 
ning difficult region. We have showed PCR based tagging of nineteen 
3000-bp targets representing the lambda genome. Viewed as an 
improvement of read length, this constitutes a 30-fold improvement 
to the Illumina system used, and illustrates the principle of convert- 
ing short-read technologies into long-range analysis. We also illu- 
strated potential applications by including a variable target region of 
canine mtDNA as well as the major part of the TP53 gene for cancer 
cell lines that, in principle, could greatly improve either accuracy or 
dynamic range of detection. Further, we show that the protocol is 
particularly suited for low complexity regions and could potentially 
be used as a validation platform for indels and structural variants. 

Methods 

DNA sample preparation. As input DNA three different origins were used, lambda 
DNA (NEB), canine mitochondrial DNA 24 and human cell line DNA. The lambda 
genome was used as a control reference for no variation. The cell line DNA, U-2 OS 
(ATCC-LGC), U-251MG (Prof. Bengt Westermark, Uppsala University) and A-431 
(DSMZ), had been obtained previously and shotgun sequenced (Akan, P., et al. 



manuscript submitted). DNA from the NIGMS Human Genetic Cell Repository, ID 
NA10831 (Coriell Institute), was used as human reference material. 

PCR and index protection. Each target was amplified by a two-step PCR, and 
received a distinct index included in one of the inner primers (Figure lb, 
Supplementary Table S3 and S4). The indices have been described previously 25,26 , as 
well as the primers for TP53 exon 2 and 9 27 and the general primers for TP53 and 
Lambda 20 . Inner cycling, 94°C 5 min, [94°C 30s, 60°C 120 s, 72°C 3 min] X 
10 cycles, 72°C 5 min, 4°C hold, outer cycling the same but 30 cycles. Starting 
material for each PCR was approximately 200,000 molecules (assuming 
500 mitochondria per cell for canine sample). After the first PCR the samples were 
automatically purified using polyethylene glycol precipitation and magnetic bead 
capture as we described previously 28 . Outer cycling was performed with general 
handles incorporated from PCR1, but including a uridine modification in the indexed 
end. The modification was used after PCR2 by incubating 1 U USER (NEB) per ug 
input DNA for 30 minutes at 37°C, thereby protecting the indexed end from 
exonuclease degradation. 

Exonuclease III degradation. For robust performance an MBS Magnatrix 1200 
(NorDiag) automatic workstation was used to perform the enzymatic steps involved in 
degradation and sub- sampling of the libraries. The exonuclease III degradation 
procedure was motivated by previous published work 29,30 . Exonuclease III, 400U, 
(Promega) was incubated at 1 X of the supplied buffer of the manufacturer in 40 ul at 
22°C, and 5 ul were sampled for each time point. Time point 1 was mixed with the 
enzyme and directly sampled (incubated ca 20s during mixing), and the subsequent 
intervals between the time points were 185 seconds. The sampling interval was chosen 
by degrading a PCR product of a three kb lambda amplicon and determining the 
approximate length by agarose gel electrophoresis for different timed intervals 
(2.1 bp/s was the approximate speed determined, Supplementary Figure SI). The 
exonuclease reaction was stopped adding 15 ul SI nuclease reaction buffer, including 
SI nuclease 100 U/reaction (Promega). The salt concentration will inhibit further 
exonuclease III activity, and the reaction was kept cooled until sampling was complete. 
The samples were heated to 37°C for 30 minutes, where SI nuclease removed the 
remaining protruding ends from exonuclease III degradation. The reactions were 
stopped by adding 5 ul stop buffer to a final concentration of 5 mM EDTA. SI 
nuclease and exonuclease III were then heat- inactivated at 70°C for 20 minutes. 

Circularization. After degradation the samples, now eight time points (TPs), were 
kept separate and prepared for circularization using an adapted protocol from the 
literature 10 . This protocol was also automated using the MBS Magnatrix 1200 
automatic workstation to facilitate rapid processing. Each DNA manipulation step 
was followed by a PEG precipitation with bead capture as we have reported previously 
for preparing libraries for massive sequencing 28,31 . End polishing in IX T4 DNA 
ligase buffer, with 0.2 mM dNTPs, supplemented with an additional 0.1 mM ATP, 
5 U T4 DNA polymerase, 10 U T4 polynucleotide kinase, and 5 U Taq DNA 
polymerase (Fermentas). Methylation in IX EcoRI methyltransferase buffer, with 
320 uM SAM and 80 U EcoRI methyltransferase (NEB). Adapter ligation in 1 X quick 
ligase buffer, with 5 pmol hairpin adapter corresponding to that time point 
(Supplementary Table S5) and 2 ul quick ligase (NEB). Exonuclease treatment in 1 X 
of NEBuffer 4 with 10 U lambda exonuclease, 40 U of exonuclease I and 20 U of T7 
exonuclease. EcoRI digestion in 1 X NEBuffer 2 with 20 U EcoRI-HF (NEB). All the 
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above steps were performed in 50 ul volume and purified by PEG/CA purification to 
20 ul. TPs were pooled and circularized in IX NEBuffer 4, with 0.1 mM ATP and 
5 ul quick ligase for 16h and QIAquick (QIAgen) purified to 60 ul. Exonuclease 
treatment 40 U exonuclease I and 40 U of lambda exonuclease for 30 minutes at 37°C, 
and 20 minutes at 80°C, then purified by MinElute (QIAgen) to 15 ul EB. 

In vitro transposition, bead enrichment and amplification. The circularized 
constructs were fragmented by in vitro transposition ("Nextera", Epicentre) as 
described earlier for standard shotgun libraries 32 . All material from the previous step 
was incubated in the HMW buffer (Epicentre) in 20 ul at 55°C for 5 minutes 
according to the manufacturer's instructions. The samples were then purified using 
QIAquick columns (QIAGEN) to 50 ul in EB (QIAGEN). 25 ul M270 Streptavidin 
Dynabeads (Invitrogen) were washed and resuspended in 50 ul bind and wash buffer 
("BW", 5 mM Tris/HCl, 0.5 mM EDTA, 1 M NaCl pH 7.5). The 50 ul fragmented 
material was added and incubated for 30 min with agitation at room temperature. 
After binding the beads were washed 2 X 200 ul in B W, and 2 X 500 ul in EB. The 
beads were resuspended in 20 ul EB. PCR amplification was performed using all 
beads and Nextera compatible primers (Supplementary Table S6) at 200 nM and 1 X 
Phusion HF Master Mix (Fermentas) at 72°C for 3 minutes, 95°C for 30s [95°C for 
10s, 62°C for 30s, 72°C for 3 min] X 15 cycles, 4°C hold. 

Sequencing and data processing. The libraries were sequenced on an Illumina HiSeq 
2000 instrument using standard clustering and sequencing reagents according to the 
manufacturers instructions for paired end 101 bp sequencing. The sequences were 
processed using custom made python scripts. First all sequences containing an ID-tag 
and a CID-tag were classified by writing an identifier to the sequence name. Sequences 
not containing an ID-tag were removed. Second, all sequences containing two or one 
correct TP-tag were classified and moved on, sequences with non-matching TP-tags 
(chimeras) or unknown TP-tags or sequences missing the circularization adapter were 
discarded. Third, all reads where split into a file according to the ID-tag classification. 
All files were mapped to the respective amplicon reference (GRCh37 for human, 
Genbank accession U96639.2 for canine mitochondrion, J02459.1 for Lambda). A 
script for duplication removal was written for this type of data where not only 
common mapping positions is prevalent but also common CID-tags. In the case where 
reads for a particular mapping position had identical read length and CID a 
duplication was classified, in which case the read with highest mapping score was kept, 
the others discarded. A local inversion structural variation was included by inverting 
position 939-1950 of the TP53 amplicon reference. 

Reference sequencing. The lambda genome had previously been prepared for a mate 
pair library according to the Illumina standard protocol for mate pair sequencing. The 
shotgun reads of this library were extracted for comparison in this study. Sample 
NA10831 was PCR amplified and fragmented using Covaris to approximately 
400 bp, then prepared for sequencing using the Ovation SP Ultralow System 
(NuGen) and sequenced on a MiSeq (Illumina) 2 X 151 bp, using standard reagents 
according to manufacturer's specifications. Sample NA10831 was also Sanger 
sequenced using an ABI 3730x1 (Life Technologies) on PCR products from primers 
targeting homopolymers (Figure 3, Supplementary Table S7). The canine 
mitochondrial sample was previously prepared and Sanger sequenced 2433 . 

Data visualization. Data were viewed and analyzed in either Geneious v.5.6 34 or IGV 
2 35 . Coverage plots and variations were called and settings are described as they are 
presented. 
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